An analytic perspective on Income, Race and Drug use.#
Kiefer Plender, Remolo van de Plassen, Ouail Moukthari, Huub Al
Disclaimer: The plotly plots in our jupyterbook do not have annotations within the graph. This is done on purpose because the responsive nature of the plotly library made the captions of our figures inconsistent for different screen sizes. Instead we used markdown for the figure captions.
1.1 Introduction#
Drug abuse is a hard and intricate issue affecting big parts of modern society. Stepping away from bias and stereotypes, our data story wishes to provide some clear, yet distinct, views on drug abuse. Presenting two different perspectives on drug abuse, trying to provide a wide view of the topic.
Our first perspective investigates whether or not individuals that belong to a racial minority group are more likely to abuse illicit drugs. Following the narrative that these people might have more challenges in day-to-day life, such as financial problems or fewer job opportunities (Darity Jr., W. A., Hamilton, D., & Dietrich, J. (2018)). Due to the nature of drugs (specifically downers), we think these people might pick up drug habits to deal with these problems earlier than more well-off individuals and/or different races. Since this perspective is centered around culture we are also looking into stereotypical gender roles and character traits that might influence drug use.
The second perspective suggests a broader view of the overall topic. It states that drug use is a universal problem and factors like race or income do not play a direct role. Individuals with lower incomes may be more vulnerable to drug abuse, but low income isn’t the only factor that contributes to this statistic. Our data study relies on the notion that we can attribute the issue to more general factors, like peer pressure or general sensitivity to addiction.
When reviewing these two perspectives, we aim to present a more nuanced view on drug abuse and its victims. Challenging the current stereotypes and stigmas associated with drug abuse can create a society that is educated and supports victims affected by this issue (Livingston, Milne, Fang, & Amari, 2012).
1.2 Dataset and preprocessing#
In pursuit of providing a clear overview, we decided to use two datasets. At first we only had one, but after examining our results for the second perspective we figured it would be a good idea to back these findings up even more by utilising another dataset.
The first one being a large dataset from the 2015 National Survey on Drug Use and Health of the US government. The survey captures a representative general view of the USA adult population. Fortunately, the dataset contained very clear data that didn’t require much pre-processing to be usable. However, due to it being survey data the findings were of the binary type and needed to be translated to their corresponding real-world values. We had to utilise the Legenda to provide a more intuitive interpretation. As such we converted variables like sex which have a value of 1 or 2, to the corresponding nominal values like ‘Male’ or ‘Female’. Other than this process of translating there wasn’t any need for preprocessing for the creating of the figures.
The second one is a slightly smaller dataset from the National Health and Nutrition Examination Survey 2017-2018, also done by the US government. This is a broader survey focused on all sorts of health related questions, such as dietary questions. The survey was done on around 10.000 people, but not all people answered every part of the survey. For the drugs part of the survey there were only around 1500 records of usable data for our story. We combined these records with the records from the survey about income, since all parts of the survey had an exclusive dataset. Records that had no information about drug use were erased from the dataset. In the following code-block is shown on what criteria we erased the data:
import pandas as pd
# Read .xpt file.
df_drugs = pd.read_sas('DUQ_J.xpt')
df_income = pd.read_sas('INQ_J.xpt')
# Separate the records that missed (99) and the respondents that refused to answer (77).
income_known = df_income[df_income['IND235'] != 99]
income_known = income_known[income_known['IND235'] != 77]
# Seperate records that did not have information about drug use.
druguse_known = df_drugs[df_drugs['DUQ200'] != 2]
druguse_known = druguse_known.dropna(subset='DUQ200')
# Combine the two datasets on the sequence number.
merged_df = druguse_known.merge(income_known, on='SEQN')
# Write to csv.
#merged_df.to_csv('NHANES1718.csv', index=False)
We had to consult the data dictionary for this survey to give meaning to, for example, binning within the dataset. All of the columns in the dataset were a number corresponding to the question in the survey. For a quick interpretation of the dataset this was quite bothersome. Fortunately the documentation on the survey was really well structured and answered all questions we had.
1.3 Visualisations#
Import of packages#
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
Perspective 1#
First visualisation ( Bar Plot: Drug usage by race and sex) - NSDUH#
This bar chart plot describes the average drug usage rate grouped by race and sex. The y-axis denotes the drug usage in % and the x-axis different race groups. For each race group, there is a further diversification based on sex, which in this case is either Male or Female. Specifying the data point towards Male or Female is due to gender being a possible contribution to minority or prejudice. It’s clear some races generally have higher drug usage, but this is not the main takeaway of this plot. Looking at the proportions of Male drug users to female drug users is the main interest of this plot. For Asian and Mixed groups there is not much difference per sex, but for the Black/African American race there is a big difference in drug usage between sexes. These findings are in line with what is known about differences in substance abuse between genders (Lambert, Brown, Phillips, & Ialongo, 2004). It is not uncommon for African American male adolescents to fall victim to peer pressure more than females. Another reason for this is the role family dynamics play in the prevention of drug abuse for females, such as helping raising children and setting an example for them. Obviously there are a lot more factors that might play a role in this significant difference. Notable is that for all races males have higher drug use rates. The explanation for this is complicated, but it could also be due to difference in stereotypical gender roles, as the aforementioned larger difference between sexes for African Americans.
df = pd.read_csv('nsduh_workforce_adults.csv')
df_grouped = df.groupby(['race_str', 'sex'])['anydrugever'].mean().reset_index()
df_grouped.sort_values('race_str', inplace=True)
races = df_grouped['race_str'].unique()
male_df = df_grouped[df_grouped['sex'] == 1]
female_df = df_grouped[df_grouped['sex'] == 2]
trace1 = go.Bar(x=races, y=male_df['anydrugever'].values * 100, name='Male')
trace2 = go.Bar(x=races, y=female_df['anydrugever'].values * 100, name='Female')
layout = go.Layout(
title='Drug Usage by Race and Sex',
xaxis=dict(title='Race'),
yaxis=dict(title='Drug Usage (%)', dtick=10),
barmode='group'
)
fig = go.Figure(data=[trace1, trace2], layout=layout)
fig.show()
Second visualisation ( Heat map: Percentage of Drug Use (Ever) by Race ) - NSDUH#
This plot shows the percentage of people of different ethnicities that ever used a certain type of drug. On the y-axis, are the different types of ethnicities, and on the x-axis are different types of drugs. This plot shows that marijuana is by far the drug that most people have ever tried, and crack and heroin are the drug that the least people have ever used. Native Americans seem to use some types of drugs the most out of all races: cocaine, crack, hallucinogen, inhalant, meth, and tranquilizers. According to a medically reviewed article by the American Addiction Center, this is a well-known problem among Native Americans. It could potentially be explained by historical trauma, violence (including high levels of gang violence, domestic violence, and sexual assault), poverty, high levels of unemployment, discrimination, racism, lack of health insurance, or low levels of attained education (Kaliszewski, M. (2022)). Another finding is that Asian people have tried a lot fewer drugs than other races.
df = pd.read_csv('NSDUH_Workforce_Adults.csv')
variables = ['marij_ever', 'cocaine_ever', 'crack_ever', 'heroin_ever', 'hallucinogen_ever',
'inhalant_ever', 'meth_ever', 'painrelieve_ever', 'tranq_ever', 'stimulant_ever']
full_names = {
'marij_ever': 'Marijuana',
'cocaine_ever': 'Cocaine',
'crack_ever': 'Crack',
'heroin_ever': 'Heroin',
'hallucinogen_ever': 'Hallucinogen',
'inhalant_ever': 'Inhalant',
'meth_ever': 'Methamphetamine',
'painrelieve_ever': 'Pain Reliever',
'tranq_ever': 'Tranquilizer',
'stimulant_ever': 'Stimulant'
}
total_counts = df['race_str'].value_counts()
counts = df.groupby('race_str')[variables].sum()
counts = counts.rename(columns=full_names)
proportions = counts.div(total_counts, axis=0) * 100
proportions = proportions.round(2)
fig = px.imshow(proportions, labels=dict(x="Type of drug", y="Race", color="Percentage"),
title="Percentage of Drug Use (Ever) by Race", color_continuous_scale='YlOrRd',
zmin=0, zmax=100)
annotations = []
for i in range(len(proportions)):
for j in range(len(proportions.columns)):
annotations.append(dict(
x=j,
y=i,
text=str(proportions.iloc[i, j]) + '%',
showarrow=False,
font=dict(color='black', size=8)
))
fig.update_layout(annotations=annotations)
fig.update_xaxes(side="top")
fig.show()
Perspective 2#
Third visualisation ( Correlation Plot: Income, Education, and Drugs) - NSDUH#
Our expectations beforehand were that people with lower incomes are more likely to have used different types of drugs based on their economic and social circumstances. We also expected a correlation between employment status, sick leave and drug use. However, something else appears to emerge from the correlation plot based on our data. First, we only looked at the correlation between ‘Ever Used Drugs’ and ‘Personal Income’, ‘Family Income’, ‘Education’, ‘Skip When Sick’, and ‘Employment Status’. However, we soon found that there was no correlation. That is why we finally added ‘Different Drugs This Year’ and ‘Different Drugs This Month’ to see if our expectations that we had in the beginning are correct. As can be seen from the correlation plot, there is no clear correlation between the variety of drug use and the socio-economic factors.
import plotly.figure_factory as ff
df = pd.read_csv('nsduh_workforce_adults.csv')
drug_related_columns = ['countofdrugs_month', 'countofdrugs_year', 'countofdrugs_ever']
other_columns = ['PersonalIncome', 'FamilyIncome', 'SkipSick', 'education', 'EmploymentStatus']
custom_drug_related_names = ['Different Drugs This Month', 'Different Drugs This Year', 'Ever Used Drugs']
custom_other_names = ['Personal Income', 'Family Income', 'Skip When Sick', 'Education', 'Employment Status']
partial_corr_matrix = df[drug_related_columns + other_columns].corr().loc[drug_related_columns, other_columns]
fig = ff.create_annotated_heatmap(
z=partial_corr_matrix.values,
x=custom_other_names,
y=custom_drug_related_names,
annotation_text=partial_corr_matrix.round(2).values,
showscale=True
)
fig.update_layout(
title='Correlation Matrix: income, skip when sick, education and employment status and drug use',
margin=dict(l=200, r=200, t=100, b=100)
)
fig.show()
#helped by the gpt-4 prompt: Make a correlation matrix based on the following dataset (which was provided)
# for the variables: 'countofdrugs_month', 'countofdrugs_year', 'countofdrugs_ever',
# 'personalincome', 'familyincome', 'skipsick', 'education', 'employmentstatus'
It was not surprising to observe limited correlations between employment status and the diversity of drug usage. The categorization for employment status was as follows: 1 = Full Time, 2 = Part Time, 3 = Unemployed. Correlations tend to be more prominent when analyzing continuous variables, such as income or age.
Fourth visualisation ( Parallel coordinates Plot: Income, Education, and Drugs) - NSDUH#
We had hoped for a more distinct visualization of the correlation between income and drug use using a parallel categories plot. Our intention was to visually represent the most prevalent combinations of socio-economic factors, such as education and income, that contribute to higher drug usage. The initial plot aimed to display the combinations of factors for all individuals, but it lacks significance since it is evident that the majority of people do not use drugs extensively. The bins in the graph were defined as follows: Low = 0-3 different drugs ever used, medium = 4-6 drugs, and high = 7+.
In the second graph, only the high and medium groups are depicted, providing the desired visualization. However, similar to the previous section, the combinations of variables leading to a wide range of drug usage appear to be evenly distributed and unrelated to income or education. An interesting finding is that a slightly larger proportion of above-average drug users seems to come from affluent families. This observation could potentially be an inaccurate representation of the real world due to data filtering, but there might be underlying explanations such as the unforeseen effects of nepotism or the neglect of some children from wealthy households. Although these are speculative possibilities that cannot be inferred from the available data, they present intriguing oppertunities for further research.
df = pd.read_csv('nsduh_workforce_adults.csv')
# Column names
columns = ['race_str', 'PersonalIncome', 'education', 'countofdrugs_ever', 'FamilyIncome']
# Create DataFrame
df = pd.DataFrame(df, columns=columns)
# Using qcut
df['amount_drugs_qcut'], qcut_bins = pd.cut(df['countofdrugs_ever'], bins=3, labels=['Low', 'Medium','High'], retbins=True)
print("Bins for cut:", qcut_bins)
# filter rows with only high and medium drug use.
df_filtered = df[df['amount_drugs_qcut'].isin(['Medium', 'High'])]
# Create Parallel Categories plot
parcatsall = go.Figure(data=[go.Parcats(dimensions=[
{'label': 'Personal Income', 'values': df['PersonalIncome'], 'categoryorder': 'category ascending'},
{'label': 'Education', 'values': df['education'], 'categoryorder': 'category ascending'},
{'label': 'Family Income', 'values': df['FamilyIncome'], 'categoryorder': 'category ascending'},
{'label': 'Drug Use', 'values': df['amount_drugs_qcut']},
],
line={'color': df['amount_drugs_qcut'].map({'Low': 'lightblue','Medium': 'lightgreen', 'High': 'orangered'})},
labelfont={'size': 12},
tickfont={'size': 12},
arrangement='freeform'
)],
layout={'title': 'Analysis of Income, Education, and Drug Use'})
parcatsall.show()
# Create Parallel Categories plot
parcats = go.Figure(data=[go.Parcats(dimensions=[
{'label': 'Personal Income', 'values': df_filtered['PersonalIncome'], 'categoryorder': 'category ascending'},
{'label': 'Education', 'values': df_filtered['education'], 'categoryorder': 'category ascending'},
{'label': 'Family Income', 'values': df_filtered['FamilyIncome'], 'categoryorder': 'category ascending'},
{'label': 'Drug Use', 'values': df_filtered['amount_drugs_qcut']},
],
line={'color': df_filtered['amount_drugs_qcut'].map({'Medium': 'lightgreen', 'High': 'orangered'})},
labelfont={'size': 12},
tickfont={'size': 12},
arrangement='freeform'
)],
layout={'title': 'Analysis of Income, Education, and Drug Use'})
# Show plot
parcats.show()
Bins for cut: [-0.01 3.33333333 6.66666667 10. ]
Fifth Visualization (Sunburst Graph: Crack Use and Education Level) - NSDUH#
This plot shows the relation between crack and heroin use, and education level. Crack and heroin are one of the (if not the) most addictive drugs on the market, which give interesting results regarding the (Editorial Staff, 2023). From this plot, it can be seen that people with a college degree, tend to have used less crack and heroin than other levels of education (only 1.4 percent used crack and only 0.8 percent used heroin, compared to 4.8 percent and 3.1 percent for people with no highschool education).
This might indicate that higher educated people use less drugs that are associated with addiction than people with lower education. This becomes clearer in the second plot, where there is only the distinction between higher educated or lower educated (highschool or less for lower and everything above highschool for higher). There, it can be seen that 2.6 percent of higher educated people used crack, and 4.7 percent of lower educated people. This indicates that higher educated people use less crack. The same goes for heroin, where 1.6 percent of higher educated people used it, and 2.8 for lower educated people.
The difference in percentage is relatively small so this is probably not an indication that there is a strong correlation between education and higher-addictive drug use.
def create_sunburst_plot(df, education_col, drug_cols, subplot_titles):
# Creating a subplot figure with 1 row and 2 columns
fig = make_subplots(rows=1, cols=2, specs=[[{"type": "sunburst"}, {"type": "sunburst"}]])
# Hover template
hover_template = '<b>%{label}:</b> %{percentParent:.1%}'
# Adding sunbursts to the figure
for i, drug_col in enumerate(drug_cols):
fig_sunburst = px.sunburst(df, path=[education_col, drug_col])
fig.add_trace(go.Sunburst(fig_sunburst.data[0], hovertemplate=hover_template), row=1, col=i+1)
# Adding titles for subplots
annotations = [
dict(text=subplot_titles[0], x=0.205, y=-0.15, xref='paper', yref='paper', showarrow=False, font=dict(size=15)),
dict(text=subplot_titles[1], x=0.8, y=-0.15, xref='paper', yref='paper', showarrow=False, font=dict(size=15))
]
# Updating the layout and showing the figure
fig.update_layout(title_text=f"Drug Use Across {education_col} Education Levels",
grid={"rows": 1, "columns": 2},
annotations=annotations,
legend=dict(
orientation="h",
yanchor="bottom",
y=1.02,
xanchor="right",
x=1
))
fig.show()
# Read the data
df = pd.read_csv('nsduh_workforce_adults.csv')
# Define the mapping for drugs and educations
drug_labels = {1: 'Used', 0: 'Did Not Use'}
# First Plot
education_labels_detailed = {1: "No high school", 2: "High school", 3: "Associate degree", 4: "College degree"}
df['education_detailed'] = df['education'].map(education_labels_detailed)
df['crack_ever_bool'] = df['crack_ever'].map(drug_labels)
df['heroin_ever_bool'] = df['heroin_ever'].map(drug_labels)
create_sunburst_plot(df, 'education_detailed', ['crack_ever_bool', 'heroin_ever_bool'], ['Crack', 'Heroin'])
# Second Plot
education_labels_simple = {1: "Lower", 2: "Lower", 3: "Higher", 4: "Higher"}
df['education_simple'] = df['education'].map(education_labels_simple)
create_sunburst_plot(df, 'education_simple', ['crack_ever_bool', 'heroin_ever_bool'], ['Crack', 'Heroin'])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[7], line 45
42 df['crack_ever_bool'] = df['crack_ever'].map(drug_labels)
43 df['heroin_ever_bool'] = df['heroin_ever'].map(drug_labels)
---> 45 create_sunburst_plot(df, 'education_detailed', ['crack_ever_bool', 'heroin_ever_bool'], ['Crack', 'Heroin'])
47 # Second Plot
48 education_labels_simple = {1: "Lower", 2: "Lower", 3: "Higher", 4: "Higher"}
Cell In[7], line 3, in create_sunburst_plot(df, education_col, drug_cols, subplot_titles)
1 def create_sunburst_plot(df, education_col, drug_cols, subplot_titles):
2 # Creating a subplot figure with 1 row and 2 columns
----> 3 fig = make_subplots(rows=1, cols=2, specs=[[{"type": "sunburst"}, {"type": "sunburst"}]])
5 # Hover template
6 hover_template = '<b>%{label}:</b> %{percentParent:.1%}'
NameError: name 'make_subplots' is not defined
Sixth Visualisation (Bubble Plot: Marijuana use last 30 days and family income) - NHANES#
As the NSDUH (2015) survey was primarily designed to capture the diversity of drug usage rather than its quantitative aspects, we speculated that the absence of correlations could be attributed to this factor. To address this concern, we opted for an alternative dataset from NHANES (n.d.), which encompassed information about drug use frequency, aimed to dispel any potential misunderstandings regarding the connection between income and drug use. The graph below illustrates the lack of a correlation between these variables, household income and in this case the amount of days the respondent smoked mariujana in the last month.
data = pd.read_csv('NHANES1718.csv')
# Calculate the frequency of combinations
combination_freq = data.groupby(['DUQ230', 'IND235']).size().reset_index(name='Frequency')
# Create a bubble plot using Plotly
fig = px.scatter(combination_freq, x='DUQ230', y='IND235', size='Frequency')
# Set the axis labels
fig.update_xaxes(title='Amount of days marijuana was smoked')
fig.update_yaxes(title='Household income')
# Show the plot
fig.show()
An Overview#
In this data story, we presented two distinct perspectives on the relationship between socio-economic factors and drug use. The first perspective explored the impact of cultural influences, race, and gender roles on drug abuse. The second perspective focused on visualizing the connection between income or education and drug use, with the anticipation of observing a stronger correlation. However, the data contradicted our expectations. The only instance where our expectations aligned with the data was in crack and heroin users across different education levels.
It is crucial to emphasize the difference between our expectations and the actual outcomes, as our initial assumptions were likely based on faulty stereotypes. While this simple data story does not completely rule out the possibility of a correlation between these factors, the fact that two separate datasets led us to the same conclusion is a significant finding.
We hope that this data story provides valuable insights into the topic, demonstrating the importance of questioning preconceived notions and highlighting the complexities involved in understanding the relationship between socio-economic factors and drug use.
Reflection#
Since our draft was already nearly a finished product the feedback we received was quite limited. The feedback was the following:
- Expand perspective one to include difference between sexes as well.
- Get rid of the insignificant right half of the correlation plot.
- Add percentages to the sunburst graph so it is more easily interpreted.
- Maybe search for more significant correlations, and if those can't be found document really well on the correlations you were looking for but didn't find. You can try using another dataset.
The first three points were really easy to fix but obviously the fourth one was a little more difficult. We tried different variables within our old dataset but we couldn’t find any correlations. To make sure nothing was left to chance we added another dataset to our story, coming to the same conclusion. So we decided to explain the correlations we didn’t find as clear as possible.
In retrospect the abundance of correlations for the second perspective actually improved our data story by a lot. This is because we expected a different outcome and we are probably not the only ones. This data story wished to step away from bias and stereotypes, and by doing so actually more or less debunked one. Overall the peer feedback helped our data story a lot and improved its quality even further.
Work Distribution#
KIEFER MAKEN DIE WORK DISTRIBUTION GRAAG IN FIJN KOPIEERBARE MARKDOWN
Appendix#
- Livingston, J. D., Milne, T., Fang, M. L., & Amari, E. (2011). The effectiveness of interventions for reducing stigma related to substance use disorders: a systematic review . Addiction, 106(10), 1786-1796. doi:10.1111/j.1360-0443.2011.03601.x
- Darity Jr., W. A., Hamilton, D., & Dietrich, J. (2018). The Persistent Effect of Race and the Legacy of Slavery on Income Inequality in the United States. Review of Black Political Economy, 45(1), 29-60. doi:10.1007/s12114-017-9250-9
- Lambert, S. F., Brown, T. L., Phillips, C. M., & Ialongo, N. S. (2004). Gender and Ethnic Differences in the Predictors of Drug Use among African American Adolescents. Journal of Youth and Adolescence, 33(5), 373-387. doi:10.1023/B:JOYO.0000032675.06729.f0
- Kaliszewski, M. (2022, September 12). Alcohol and Drug Abuse Among Native Americans. Retrieved from American Addiction Centers
- OpenAI. (2021). ChatGPT: Language Model. Open AI
- Substance Abuse and Mental Health Services Administration. (2016). Results from the 2015 National Survey on Drug Use and Health: Summary of National Findings (NSDUH Series H-51, HHS Publication No. SMA 16-4984) . Retrieved from SAMHSA
- National Center for Health Statistics. (n.d.). National Health and Nutrition Examination Survey (NHANES 2017-2018). Retrieved from CDC - NHANES
- Editorial Staff. (2023, june 22). What is the Most Addictive Drug? Here Are the Top 5 Substances. Retrieved from American Addiction Centers.